{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zFIDykq6tpY8"
   },
   "source": [
    "# App Search Engine exporter to Elasticsearch\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-engine-exporter.ipynb)\n",
    "\n",
    "This notebook explains the steps of exporting an App Search engine together with its configurations in Elasticsearch. This is not meant to be an exhaustive example for all App Search features as those will vary based on your instance, but is meant to give a sense of how you can export, migrate, and enhance your application.\n",
    "\n",
    "We will look at:\n",
    "\n",
    "- [how to export synonyms](#export-app-search-synonyms-in-elasticsearch)\n",
    "- [how to export curations](#export-app-search-curations-in-elasticsearch)\n",
    "- [how to create a new index in Elasticsearch](#create-a-new-elasticsearch-index)\n",
    "- [how to add sparse vector fields](#add-sparse_vector-fields-for-semantic-search-optional)\n",
    "- [how to query the new Elasticsearch index](#query-the-new-elasticsearch-index)\n",
    "\n",
    "## Setup\n",
    "\n",
    "Let's start by making sure our Elasticsearch and Enterprise Search clients are installed. We'll also use `getpass` to ensure we can allow secure user inputs for our IDs and keys to access our Elasticsearch instance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6-d_3fBqtmRa",
    "outputId": "a8ee102e-eaaa-45b5-b883-edb06a5bed4f"
   },
   "outputs": [],
   "source": [
    "# install packages\n",
    "import sys\n",
    "\n",
    "!{sys.executable} -m pip install -qU elasticsearch elastic-enterprise-search\n",
    "\n",
    "# import modules\n",
    "from getpass import getpass\n",
    "from elastic_enterprise_search import AppSearch\n",
    "from elasticsearch import Elasticsearch\n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wsldWRz7xSiZ"
   },
   "source": [
    "## Connect to Elasticsearch\n",
    "\n",
    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n",
    "\n",
    "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment. \n",
    "\n",
    "You will also need your **API KEY** to access your deployment. You can [create a new API key](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key) from the `Stack Management -> API keys` menu in Kibana. Be sure to copy or write down your key in a safe place once it is created it will be displayed only upon creation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "elasticsearch = Elasticsearch(\n",
    "    # For local development\n",
    "    # hosts=[\"http://localhost:9200\"]\n",
    "    cloud_id=ELASTIC_CLOUD_ID,\n",
    "    api_key=ELASTIC_API_KEY,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to App Search\n",
    "\n",
    "For this notebook we will need access to an App Search private key that can access the App Search engine we want to export.\n",
    "We will be instantiating the Enterprise Search client with the provided credentials and then check that we are correctly authenticated to Enterprise Search by getting the App Search engine details.\n",
    "\n",
    "You can find your App Search endpoint and your search private key from the `Credentials` menu inside your App Search instance in Kibana.\n",
    "\n",
    "Also note here, we define our `ENGINE_NAME`. For this examplem we are using the `national-parks-demo` sample engine that is available within App Search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "APP_SEARCH_ENDPOINT = getpass(\"App Search endpoint: \")\n",
    "APP_SEARCH_PRIVATE_KEY = getpass(\"App Search private key: \")\n",
    "\n",
    "app_search = AppSearch(APP_SEARCH_ENDPOINT, bearer_auth=APP_SEARCH_PRIVATE_KEY)\n",
    "\n",
    "# modify this with your own engine name\n",
    "ENGINE_NAME = \"national-parks-demo\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FSsSGl--uqFT"
   },
   "source": [
    "## Export App Search synonyms in Elasticsearch\n",
    "\n",
    "To get started with our export, we will first export any [synonyms](https://www.elastic.co/guide/en/app-search/current/synonyms-guide.html) we have in our App Search engine. \n",
    "\n",
    "The resulting synonyms will be placed into a new [Elasticsearch synoynm set](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-with-synonyms.html) named the same as our App Search egnine and used in analyzers for our synonyms-filter type later on.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "kpV8K5jHvRK6"
   },
   "outputs": [],
   "source": [
    "elasticsearch.synonyms.put_synonym(id=ENGINE_NAME, synonyms_set=[])\n",
    "\n",
    "for synonym_set in app_search.list_synonym_sets(engine_name=ENGINE_NAME).body[\n",
    "    \"results\"\n",
    "]:\n",
    "    elasticsearch.synonyms.put_synonym_rule(\n",
    "        set_id=ENGINE_NAME,\n",
    "        rule_id=synonym_set[\"id\"],\n",
    "        synonyms=\", \".join(synonym_set[\"synonyms\"]),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lyc-DcjnvTgH"
   },
   "source": [
    "## Export App Search curations in Elasticsearch\n",
    "\n",
    "Next, we will export any curations that may be in our App Search engine.\n",
    "\n",
    "To export App Search curations we will use Elasticsearch [query rules](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-using-query-rules.html).\n",
    "At the moment of writing this notebook Elasticsearch query rules only allow for pinning results unlike App Search curations that also allow excluding results.\n",
    "For this reason we will only export pinned results. The code below will create the necessary `query_rules` to achieve this. Note that there is a default soft limit of 100 curations for `query_rules` that can be configured up to a hard limit of 1,000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_rules = []\n",
    "\n",
    "for curation in app_search.list_curations(engine_name=ENGINE_NAME).body[\"results\"]:\n",
    "    query_rules.append(\n",
    "        {\n",
    "            \"rule_id\": curation[\"id\"],\n",
    "            \"type\": \"pinned\",\n",
    "            \"criteria\": [\n",
    "                {\n",
    "                    \"type\": \"exact\",\n",
    "                    \"metadata\": \"user_query\",\n",
    "                    \"values\": curation[\"queries\"],\n",
    "                }\n",
    "            ],\n",
    "            \"actions\": {\"ids\": curation[\"promoted\"]},\n",
    "        }\n",
    "    )\n",
    "\n",
    "\n",
    "elasticsearch.query_ruleset.put(ruleset_id=ENGINE_NAME, rules=query_rules)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yOKVxmSbvbt9"
   },
   "source": [
    "## Create a new Elasticsearch index\n",
    "\n",
    "While we could re-use the same Elasticsearch index that is storing the App Search engine documents, reindexing the data in a new index will allow us to change the mapping to use features like semantic search or to be able to use the Elasticsearch synonym set we just created.\n",
    "\n",
    "App Search has the following data types: text, number, date and geolocation. Each of these types is mapped to Elasticsearch field types.\n",
    "We can take a closer look at how App Search field types are mapped to Elasticsearch fields, by using the [`GET mapping API`](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-mapping.html).\n",
    "For App Search engines, the associated Elasticsearch index name is `.ent-search-engine-documents-[ENGINE_NAME]`, e.g. `.ent-search-engine-documents-national-parks-demo` for the App Search sample engine `national-parks-demo`.\n",
    "One thing to notice is how App Search uses [multi-fields](https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) in Elasticsearch that allow for quickly changing the field type in App Search without requiring reindexing by creating subfields for each type of supported field:\n",
    "\n",
    "<details>\n",
    "  <summary>An example schema can be found by clicking here</summary>\n",
    "\n",
    "```json\n",
    "\"[APP_SEARCH_ENGINE_FIELD_NAME]\": {\n",
    "  \"type\": \"text\",\n",
    "  \"fields\": {\n",
    "    \"date\": {\n",
    "      \"type\": \"date\",\n",
    "      \"format\": \"strict_date_time||strict_date\",\n",
    "      \"ignore_malformed\": true\n",
    "    },\n",
    "    \"delimiter\": {\n",
    "      \"type\": \"text\",\n",
    "      \"index_options\": \"freqs\",\n",
    "      \"analyzer\": \"iq_text_delimiter\"\n",
    "    },\n",
    "    \"enum\": {\n",
    "      \"type\": \"keyword\",\n",
    "      \"ignore_above\": 2048\n",
    "    },\n",
    "    \"float\": {\n",
    "      \"type\": \"double\",\n",
    "      \"ignore_malformed\": true\n",
    "    },\n",
    "    \"joined\": {\n",
    "      \"type\": \"text\",\n",
    "      \"index_options\": \"freqs\",\n",
    "      \"analyzer\": \"i_text_bigram\",\n",
    "      \"search_analyzer\": \"q_text_bigram\"\n",
    "    },\n",
    "    \"location\": {\n",
    "      \"type\": \"geo_point\",\n",
    "      \"ignore_malformed\": true,\n",
    "      \"ignore_z_value\": false\n",
    "    },\n",
    "    \"prefix\": {\n",
    "      \"type\": \"text\",\n",
    "      \"index_options\": \"docs\",\n",
    "      \"analyzer\": \"i_prefix\",\n",
    "      \"search_analyzer\": \"q_prefix\"\n",
    "    },\n",
    "    \"stem\": {\n",
    "      \"type\": \"text\",\n",
    "      \"analyzer\": \"iq_text_stem\"\n",
    "    }\n",
    "  },\n",
    "  \"index_options\": \"freqs\",\n",
    "  \"analyzer\": \"iq_text_base\"\n",
    "}\n",
    "```\n",
    "</details>\n",
    "\n",
    "In our case we can assume that we have a well established schema and we do not need to use all multi-fields.\n",
    "\n",
    "We can retrieve the field types of an App Search engine using the [Schema API](https://www.elastic.co/guide/en/app-search/current/schema.html) and then construct our mapping.\n",
    "\n",
    "Also note that below, we set up variables for our `SOURCE_INDEX` and `DEST_INDEX`. If you want your destination index to be named differently, you can edit it here as these variables are used throughout the rest of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define SOURCE_INDEX and DEST_INDEX which we will continue to reuse; feel free to adjust DEST_INDEX\n",
    "SOURCE_INDEX = \".ent-search-engine-documents-\" + ENGINE_NAME\n",
    "DEST_INDEX = \"new-\" + ENGINE_NAME\n",
    "\n",
    "# delete the index if it's already created\n",
    "if elasticsearch.indices.exists(index=DEST_INDEX):\n",
    "    elasticsearch.indices.delete(index=DEST_INDEX)\n",
    "\n",
    "# get the App Search engine schema\n",
    "schema = app_search.get_schema(engine_name=ENGINE_NAME)\n",
    "\n",
    "# construct the Elasticsearch mapping\n",
    "mapping = {}\n",
    "\n",
    "for field_name in schema:\n",
    "    field_type = schema[field_name]\n",
    "\n",
    "    if field_type == \"date\":\n",
    "        mapping[field_name] = {\n",
    "            \"type\": \"date\",\n",
    "            \"format\": \"strict_date_time||strict_date\",\n",
    "            \"ignore_malformed\": True,\n",
    "        }\n",
    "    elif field_type == \"location\":\n",
    "        mapping[field_name] = {\"type\": \"geo_point\", \"ignore_z_value\": False}\n",
    "    elif field_type == \"number\":\n",
    "        mapping[field_name] = {\"type\": \"double\"}\n",
    "    elif field_type == \"text\":\n",
    "        # feel free to modify this with your own mapping for text fields\n",
    "        mapping[field_name] = {\n",
    "            \"fields\": {\n",
    "                \"keyword\": {\"type\": \"keyword\", \"ignore_above\": 2048},\n",
    "                \"delimiter\": {\n",
    "                    \"type\": \"text\",\n",
    "                    \"index_options\": \"freqs\",\n",
    "                    \"analyzer\": \"iq_text_delimiter\",\n",
    "                },\n",
    "                \"joined\": {\n",
    "                    \"type\": \"text\",\n",
    "                    \"index_options\": \"freqs\",\n",
    "                    \"analyzer\": \"i_text_bigram\",\n",
    "                    \"search_analyzer\": \"q_text_bigram\",\n",
    "                },\n",
    "                \"prefix\": {\n",
    "                    \"type\": \"text\",\n",
    "                    \"index_options\": \"docs\",\n",
    "                    \"analyzer\": \"i_prefix\",\n",
    "                    \"search_analyzer\": \"q_prefix\",\n",
    "                },\n",
    "                \"stem\": {\"type\": \"text\", \"analyzer\": \"iq_text_stem\"},\n",
    "            },\n",
    "            \"type\": \"text\",\n",
    "            \"index_options\": \"freqs\",\n",
    "            \"analyzer\": \"i_text_base\",\n",
    "            \"search_analyzer\": \"q_text_base\",\n",
    "        }\n",
    "\n",
    "# These are similar to the Elasticsearch analyzers we use for App Search.\n",
    "# The main difference is that we are also adding a synonyms filter so that we can\n",
    "# leverage the Elasticsearch synonym set we created in a previous step.\n",
    "# If you want a different mapping for text fields, feel free to modify.\n",
    "settings = {\n",
    "    \"analysis\": {\n",
    "        \"filter\": {\n",
    "            \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n",
    "            \"bigram_joiner\": {\n",
    "                \"max_shingle_size\": \"2\",\n",
    "                \"token_separator\": \"\",\n",
    "                \"output_unigrams\": \"false\",\n",
    "                \"type\": \"shingle\",\n",
    "            },\n",
    "            \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n",
    "            \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n",
    "            \"bigram_joiner_unigrams\": {\n",
    "                \"max_shingle_size\": \"2\",\n",
    "                \"token_separator\": \"\",\n",
    "                \"output_unigrams\": \"true\",\n",
    "                \"type\": \"shingle\",\n",
    "            },\n",
    "            \"delimiter\": {\n",
    "                \"split_on_numerics\": \"true\",\n",
    "                \"generate_word_parts\": \"true\",\n",
    "                \"preserve_original\": \"false\",\n",
    "                \"catenate_words\": \"true\",\n",
    "                \"generate_number_parts\": \"true\",\n",
    "                \"catenate_all\": \"true\",\n",
    "                \"split_on_case_change\": \"true\",\n",
    "                \"type\": \"word_delimiter_graph\",\n",
    "                \"catenate_numbers\": \"true\",\n",
    "                \"stem_english_possessive\": \"true\",\n",
    "            },\n",
    "            \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n",
    "            \"synonyms-filter\": {\n",
    "                \"type\": \"synonym_graph\",\n",
    "                \"synonyms_set\": ENGINE_NAME,\n",
    "                \"updateable\": True,\n",
    "            },\n",
    "        },\n",
    "        \"analyzer\": {\n",
    "            \"i_prefix\": {\n",
    "                \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"iq_text_delimiter\": {\n",
    "                \"filter\": [\n",
    "                    \"delimiter\",\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"en-stop-words-filter\",\n",
    "                    \"en-stem-filter\",\n",
    "                ],\n",
    "                \"tokenizer\": \"whitespace\",\n",
    "            },\n",
    "            \"q_prefix\": {\n",
    "                \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"i_text_base\": {\n",
    "                \"filter\": [\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"en-stop-words-filter\",\n",
    "                ],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"q_text_base\": {\n",
    "                \"filter\": [\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"en-stop-words-filter\",\n",
    "                    \"synonyms-filter\",\n",
    "                ],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"iq_text_stem\": {\n",
    "                \"filter\": [\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"en-stop-words-filter\",\n",
    "                    \"en-stem-filter\",\n",
    "                ],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"i_text_bigram\": {\n",
    "                \"filter\": [\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"en-stem-filter\",\n",
    "                    \"bigram_joiner\",\n",
    "                    \"bigram_max_size\",\n",
    "                ],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "            \"q_text_bigram\": {\n",
    "                \"filter\": [\n",
    "                    \"cjk_width\",\n",
    "                    \"lowercase\",\n",
    "                    \"asciifolding\",\n",
    "                    \"synonyms-filter\",\n",
    "                    \"en-stem-filter\",\n",
    "                    \"bigram_joiner_unigrams\",\n",
    "                    \"bigram_max_size\",\n",
    "                ],\n",
    "                \"tokenizer\": \"standard\",\n",
    "            },\n",
    "        },\n",
    "    }\n",
    "}\n",
    "\n",
    "# and actually create our index\n",
    "elasticsearch.indices.create(\n",
    "    index=DEST_INDEX, mappings={\"properties\": mapping}, settings=settings\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Add `sparse_vector` fields for semantic search (optional)\n",
    "\n",
    "One of the advantages of having our exported index directly in Elasticsearch is that we can easily take advantage of doing semantic search with ELSER. To do this, we'll need to add a `sparse_vector` field to our index, set up an ingest pipeline, and reindex our data.\n",
    "\n",
    "Note that to use this feature, your cluster must have at least one ML node set up with enough resources allocated to it.\n",
    "\n",
    "Let's first start by adding `sparse_vector` fields to our new index mapping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# by default we are adding a `sparse_vector` field for all text fields in our engine\n",
    "# feel free to modify this list to only include the fields that are relevant\n",
    "SPARSE_VECTOR_FIELDS = [\n",
    "    field_name + \"_semantic\" for field_name in schema if schema[field_name] == \"text\"\n",
    "]\n",
    "\n",
    "sparse_vector_fields = {}\n",
    "for field_name in SPARSE_VECTOR_FIELDS:\n",
    "    # this is added so we can use semantic search with ELSER\n",
    "    sparse_vector_fields[field_name] = {\"type\": \"sparse_vector\"}\n",
    "\n",
    "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=sparse_vector_fields)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup an ingest pipeline using ELSER\n",
    "\n",
    "> If you have not already deployed ELSER, follow this [guide](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) on how to download and deploy the model. Without this step, you will receive errors below when you run the `reindex` command.\n",
    "\n",
    "Assuming you have downloaded and deployed ELSER in your deployment, we can now define an ingest pipeline that will enrich the documents with the `sparse_vector` fields that can be used with semantic search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PIPELINE = \"elser-ingest-pipeline-\" + ENGINE_NAME\n",
    "\n",
    "processors = []\n",
    "\n",
    "for output_field in SPARSE_VECTOR_FIELDS:\n",
    "    input_field = output_field.removesuffix(\"_semantic\")\n",
    "    processors.append(\n",
    "        {\n",
    "            \"inference\": {\n",
    "                \"model_id\": \".elser_model_2\",\n",
    "                \"input_output\": [\n",
    "                    {\"input_field\": input_field, \"output_field\": output_field}\n",
    "                ],\n",
    "                \"on_failure\": [\n",
    "                    {\n",
    "                        \"append\": {\n",
    "                            \"field\": \"_source._ingest.inference_errors\",\n",
    "                            \"allow_duplicates\": False,\n",
    "                            \"value\": [\n",
    "                                {\n",
    "                                    \"message\": \"Processor failed for field '\"\n",
    "                                    + input_field\n",
    "                                    + \"' with message '{{ _ingest.on_failure_message }}'\",\n",
    "                                    \"timestamp\": \"{{{ _ingest.timestamp }}}\",\n",
    "                                }\n",
    "                            ],\n",
    "                        }\n",
    "                    }\n",
    "                ],\n",
    "            }\n",
    "        }\n",
    "    )\n",
    "\n",
    "# create the ingest pipeline\n",
    "elasticsearch.ingest.put_pipeline(\n",
    "    id=PIPELINE, description=\"Ingest pipeline for ELSER\", processors=processors\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reindex the data\n",
    "Now that we have created the Elasticsearch index and the ingest pipeline, it's time to reindex our data in the new index. The pipeline definition we created above will create a field for each of the `SPARSE_VECTOR_FIELDS` we defined with a `_semantic` suffix, and then infer the sparse vector values from ELSER as the reindex takes place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reindex_task = elasticsearch.reindex(\n",
    "    source={\"index\": SOURCE_INDEX},\n",
    "    dest={\"index\": DEST_INDEX, \"pipeline\": PIPELINE},\n",
    "    wait_for_completion=False,\n",
    ")\n",
    "\n",
    "task_id = reindex_task[\"task\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that above in the reindex command, we set `wait_for_completion` to false. Inference can possibly take a while and we might run the risk of our command timing out.\n",
    "The call above will return a task that we can watch and see its progress the the `tasks` endpoint:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "task_response = elasticsearch.tasks.get(task_id=task_id)\n",
    "print(json.dumps(task_response.body, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Query the new Elasticsearch index\n",
    "\n",
    "We will exemplify:\n",
    "\n",
    "- [how to replicate App Search queries](#how-to-build-app-search-like-queries)\n",
    "- [how to do semantic search using ELSER](#how-to-do-semantic-search-using-elser)\n",
    "- [how to combine App Search queries and ELSER](#how-to-combine-app-search-queries-with-elser)\n",
    "\n",
    "### How to build App Search like queries\n",
    "\n",
    "App Search exposes a [search_explain API](https://www.elastic.co/guide/en/app-search/current/search-explain.html) that receives an App Search query and returns the Elasticsearch query built by App Search.\n",
    "\n",
    "```bash\n",
    "curl -X POST '${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/national-parks-demo/search_explain' \\\n",
    "-H 'Content-Type: application/json' \\\n",
    "-H 'Authorization: Bearer private-xxxxxx' \\\n",
    "-d '{\n",
    "  \"query\": \"park\"\n",
    "}'\n",
    "```\n",
    "\n",
    "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QUERY_STRING = \"park\"\n",
    "\n",
    "result_fields = list(schema.keys())\n",
    "\n",
    "text_fields = [field_name for field_name in schema if schema[field_name] == \"text\"]\n",
    "best_fields = [field_name + \".stem\" for field_name in text_fields]\n",
    "\n",
    "cross_fields = []\n",
    "\n",
    "for text_field in text_fields:\n",
    "    cross_fields.append(text_field + \"^1.0\")\n",
    "    cross_fields.append(text_field + \".stem^0.95\")\n",
    "    cross_fields.append(text_field + \".prefix^0.1\")\n",
    "    cross_fields.append(text_field + \".joined^0.75\")\n",
    "    cross_fields.append(text_field + \".delimiter^0.4\")\n",
    "\n",
    "app_search_query_payload = {\n",
    "    \"query\": {\n",
    "        \"rule_query\": {\n",
    "            \"organic\": {\n",
    "                \"bool\": {\n",
    "                    \"should\": [\n",
    "                        {\n",
    "                            \"multi_match\": {\n",
    "                                \"query\": QUERY_STRING,\n",
    "                                \"minimum_should_match\": \"1<-1 3<49%\",\n",
    "                                \"type\": \"cross_fields\",\n",
    "                                \"fields\": cross_fields,\n",
    "                            }\n",
    "                        },\n",
    "                        {\n",
    "                            \"multi_match\": {\n",
    "                                \"query\": QUERY_STRING,\n",
    "                                \"minimum_should_match\": \"1<-1 3<49%\",\n",
    "                                \"type\": \"best_fields\",\n",
    "                                \"fuzziness\": \"AUTO\",\n",
    "                                \"prefix_length\": 2,\n",
    "                                \"fields\": best_fields,\n",
    "                            }\n",
    "                        },\n",
    "                    ]\n",
    "                }\n",
    "            },\n",
    "            \"ruleset_id\": ENGINE_NAME,\n",
    "            \"match_criteria\": {\"user_query\": QUERY_STRING},\n",
    "        }\n",
    "    },\n",
    "    \"sort\": [{\"_score\": \"desc\"}, {\"_doc\": \"desc\"}],\n",
    "    \"highlight\": {\n",
    "        \"fragment_size\": 300,\n",
    "        \"type\": \"plain\",\n",
    "        \"number_of_fragments\": 1,\n",
    "        \"order\": \"score\",\n",
    "        \"encoder\": \"html\",\n",
    "        \"require_field_match\": False,\n",
    "        \"fields\": {},\n",
    "    },\n",
    "    \"size\": 10,\n",
    "    \"_source\": result_fields,\n",
    "}\n",
    "\n",
    "print(f\"Elasticsearch payload:\\n{json.dumps(app_search_query_payload, indent=2)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our fully flushed out query, we can use that to perform the actual search:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = elasticsearch.search(\n",
    "    index=SOURCE_INDEX,\n",
    "    query=app_search_query_payload[\"query\"],\n",
    "    highlight=app_search_query_payload[\"highlight\"],\n",
    "    source=app_search_query_payload[\"_source\"],\n",
    "    sort=app_search_query_payload[\"sort\"],\n",
    "    size=app_search_query_payload[\"size\"],\n",
    ")\n",
    "print(json.dumps(results.body, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to do semantic search using ELSER\n",
    "\n",
    "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n",
    "For each `spare_vector` we will generate a `text_expansion` query. These `text_expansion` queries will be added as `should` clauses to a top-level `bool` query.\n",
    "We also use `min_score` because we want to exclude less relevant results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace with your own\n",
    "QUERY_STRING = \"Which national park has dangerous wild animals?\"\n",
    "text_expansion_queries = []\n",
    "\n",
    "for field_name in SPARSE_VECTOR_FIELDS:\n",
    "    text_expansion_queries.append(\n",
    "        {\n",
    "            \"text_expansion\": {\n",
    "                field_name: {\"model_id\": \".elser_model_2\", \"model_text\": QUERY_STRING}\n",
    "            }\n",
    "        }\n",
    "    )\n",
    "\n",
    "semantic_query = {\"bool\": {\"should\": text_expansion_queries}}\n",
    "print(f\"Elasticsearch query:\\n{json.dumps(semantic_query, indent=2)}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=20)\n",
    "print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to combine App Search queries with ELSER\n",
    "\n",
    "We will now provide an example on how to combine the previous two queries into a single query that applies both BM25 search and semantic search.\n",
    "In the previous examples, we have a `bool` query with `should` clauses.\n",
    "We will combine them in a single `bool` query and wrap this `bool` query in a `rule_query`.\n",
    "The `rule_query` is used to pin results based on the query string, similarly to App Search curations.\n",
    "The high-level structure of the query is following:\n",
    "\n",
    "```json\n",
    "GET [DEST-INDEX]\n",
    "{\n",
    "  \"query\": {\n",
    "    \"rule_query\": {\n",
    "      \"organic\": {\n",
    "        \"bool\": {\n",
    "          \"should\": [\n",
    "            // multi_match query with best_fields from App Search generated query\n",
    "            // multi_match query with cross_fields from App Search generated query\n",
    "            // text_expansion queries for sparse_vector fields\n",
    "          ]\n",
    "        }\n",
    "      }  \n",
    "    }\n",
    "  }\n",
    "}\n",
    "```\n",
    "\n",
    "We are again using `min_score` to exclude less relevant results.\n",
    "In our example we are not boosting any of the `should` clauses, but this can be a way to boost ELSER results over BM25 results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload = app_search_query_payload.copy()\n",
    "\n",
    "for text_expansion_query in text_expansion_queries:\n",
    "    payload[\"query\"][\"rule_query\"][\"organic\"][\"bool\"][\"should\"].append(\n",
    "        text_expansion_query\n",
    "    )\n",
    "\n",
    "print(f\"Elasticsearch payload:\\n{json.dumps(payload, indent=2)}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = elasticsearch.search(\n",
    "    index=SOURCE_INDEX,\n",
    "    query=payload[\"query\"],\n",
    "    highlight=payload[\"highlight\"],\n",
    "    source=payload[\"_source\"],\n",
    "    sort=payload[\"sort\"],\n",
    "    size=payload[\"size\"],\n",
    "    min_score=1,\n",
    ")\n",
    "\n",
    "print(f\"Text expansion query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
